-
Notifications
You must be signed in to change notification settings - Fork 92
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Introduce APIError for Sync Calls from CLI to Orchestrator Node #4366
Conversation
@@ -176,7 +177,7 @@ func (b *BoltJobStore) getJob(tx *bolt.Tx, jobID string) (models.Job, error) { | |||
|
|||
data := GetBucketData(tx, NewBucketPath(BucketJobs, jobID), SpecKey) | |||
if data == nil { | |||
return job, bacerrors.NewJobNotFound(jobID) | |||
return job, apimodels.NewJobNotFound(jobID) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it doesn't sound right that internal and backend components depend on apimodels
which are meant for the http API between frontend (api server) and clients.
Some suggestions:
- We should be able to define internal errors separate from the APIs, and then implement interceptors that can translate internal errors to http errors
- At the same time we should avoid having to translate each individual error to http format. So maybe having an interface that each of these errors implement, or a common underlying error type can simplify the translation
- We should avoid having a central pkg that defines all internal errors in the system like what we have today with bacerror
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done, please let me know your thoughts about it.
pkg/publicapi/apimodels/error.go
Outdated
// httpstatuscode is the http status code associated with this error. | ||
// it should correspond to standard http status codes (e.g., 400, 404, 500). | ||
HTTPStatusCode int `json:"code"` | ||
|
||
// message is a short, human-readable description of the error. | ||
// it should be concise and provide a clear indication of what went wrong. | ||
Message string `json:"message"` | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should also return:
- RequestID to help debug and grep server logs for the failed request
- ErrorCode that uniquely identify the error error without having to parse the message, similar to AWS
Keep in mind that some of this information are already in the http headers, such as status code and request id. It might be redundant to serialize them in the APIError. Maybe we should omit serializing them, and just have the client set those values from the headers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have added the RequestID. However, adding a bacalhau specific error code is tricky at this point. Mainly because we are need to make sure that we properly propagate error codes correctly every where. In addition, we need to think about how to formulate error codes for errors that occur before even we reach to bacalhau specific components, for example echo server middlewares.
I think we should pick that up in follow ups. Let me know what you think
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah please feel free to do this in a follow up PR. What I had in mind is something similar to the code below. The idea is to have standard error codes that can be used by our core components and even the modules if the errors apply, while still enabling the modules to define their own specific error code, such as BucketNotFound
and ContainerStartFailed
The HTTPCode
can be explicitly provided, inferred from the error code if it is a standard one, or fallback to 500.
What I really liked in your PR is the introduction of Component
field which helps identify the source of the error and also removes the need for these error codes to be globally unique. I would just make the components more readable and descriptive, such as S3Publisher
, Docker
instead of abbreviations. To simplify things, I am proposing the default component to be Bacalhau
, which doesn't mean much, but will allow us to incrementally improve the coverage of better errors.
Also as we progress, we should allow components to override the Component
piece of the errors. Meaning a job not found error with JobStore
component doesn't help much compared to knowing the component that actually called the job store. We don't need to do these changes today, but just thinking ahead.
What do you think?
const (
BadRequest ErrorCode = "BadRequest"
Unauthorized ErrorCode = "Unauthorized"
ResourceNotFound ErrorCode = "ResourceNotFound"
ResourceInUse ErrorCode = "ResourceInUse"
ValidationFailed ErrorCode = "ValidationFailed"
VersionMismatch ErrorCode = "VersionMismatch"
TooManyRequests ErrorCode = "TooManyRequests"
InternalError ErrorCode = "InternalError"
NotImplemented ErrorCode = "NotImplemented"
ServiceUnavailable ErrorCode = "ServiceUnavailable"
DatastoreFailure ErrorCode = "DatastoreFailure"
NetworkFailure ErrorCode = "NetworkFailure"
ConfigurationError ErrorCode = "ConfigurationError"
ResourceExhausted ErrorCode = "ResourceExhausted"
)
// Note: Component-specific error codes should be defined in their respective packages.
// Examples might include:
// - BucketNotFound
// - ObjectNotFound
// - InvalidCredentials
// - ImageNotFound
// - ContainerStartFailed
// - PeerConnectionFailed
type BaseError struct {
Code ErrorCode `json:"code"`
Message string `json:"message"`
Component string `json:"component"`
HTTPCode int `json:"http_code"`
Hint string `json:"hint,omitempty"`
Retryable bool `json:"retryable"`
FailsExecution bool `json:"fails_execution"`
Details map[string]string `json:"details,omitempty"`
}
func NewBaseError(format string, a ...any) *BaseError {
return &BaseError{
Code: InternalError,
Message: fmt.Sprintf(format, a...),
Component: "Bacalhau",
HTTPCode: 0, // Will be inferred later if not set explicitly
}
}
func (e *BaseError) Error() string {
return fmt.Sprintf("[%s] %s: %s", e.Code, e.Component, e.Message)
}
func (e *BaseError) WithErrorCode(code ErrorCode) *BaseError {
e.Code = code
return e
}
func (e *BaseError) WithHTTPCode(httpCode int) *BaseError {
e.HTTPCode = httpCode
return e
}
func (e *BaseError) HTTPStatusCode() int {
if e.HTTPCode != 0 {
return e.HTTPCode
}
return inferHTTPStatusCode(e.Code)
}
func inferHTTPStatusCode(code ErrorCode) int {
switch code {
case BadRequest, ValidationFailed:
return http.StatusBadRequest
case Unauthorized:
return http.StatusUnauthorized
case ResourceNotFound:
return http.StatusNotFound
case ResourceInUse, VersionMismatch:
return http.StatusConflict
case TooManyRequests:
return http.StatusTooManyRequests
case NotImplemented:
return http.StatusNotImplemented
case ServiceUnavailable:
return http.StatusServiceUnavailable
default:
return http.StatusInternalServerError
}
}
// rest of the methods
pkg/jobstore/errors.go
Outdated
func NewBoltDbError(message string) *models.BaseError { | ||
return models.NewBaseError(message).WithCode(models.NewErrorCode(BOLTDB_COMPONENT, 500)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: this should fit in boltdb pkg as it is related to the job store implementation. I understand I asked to define and use errors mainly in the jobstore
pkg, but that should be for common errors that apply to different implementations
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
pkg/publicapi/apimodels/error.go
Outdated
|
||
// RequestID is the request ID of the request that caused the error. | ||
RequestID string `json:"request_id"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: you might want to define RequestID and HTTPStatusCode as transient fields so you can avoid serializing them and then populate them in the client from the headers. This is doable by using json:-
tag
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
func isDebugMode() bool { | ||
return os.Getenv("BACALHAU_DEBUG") == "true" | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We already have some logic in enabling debug mode in our http server, but they are not consistent yet, and maybe we should unify them and avoid adding a third mode
bacalhau/pkg/publicapi/server.go
Lines 93 to 101 in 960ceb1
// enable debug mode to get clearer error messages | |
// TODO: disable debug mode after we implement our own error handler | |
server.Router.Debug = true | |
// set middleware | |
logLevel, err := zerolog.ParseLevel(params.Config.LogLevel) | |
if err != nil { | |
return nil, err | |
} |
I would say we can just rely on the LogLevel
configuration, and if it is debug
or trace
, then we set Router.Debug = true
In your error handler you can then do if c.Echo.Debug == true
to check if debug mode is enabled in the router
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
pkg/jobstore/boltdb/store.go
Outdated
@@ -190,7 +191,7 @@ func (b *BoltJobStore) reifyJobID(tx *bolt.Tx, jobID string) (string, error) { | |||
if idgen.ShortUUID(jobID) == jobID { | |||
bktJobs, err := NewBucketPath(BucketJobs).Get(tx, false) | |||
if err != nil { | |||
return "", err | |||
return "", jobstore.NewBoltDbError(err.Error()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: I would suggest to pass the actual error instead of just the string. This will allow NewBoltDbError to intercept the error and return different types or codes based on the actual error.
My not be needed here or now, but I am just thinking ahead when we capture and throw errors for docker executor and s3 publisher for example. You might find this pattern useful to centralize intercepting errors per component
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
- introduced by #4366, we need to close the body only if there isn't an error, else the body is nil and panics. Mentioned here #4401 (comment) Co-authored-by: frrist <[email protected]> Co-authored-by: Walid Baruni <[email protected]>
This PR aims at introducing APIError, which is responsible for encapsulating all Synchronous error for requestes from CLI to Orchestrator Node.
We introduce a CustomHttpErrorHandler, which is responsible for translating errors to correct response status code. For example if a job does not exist, we should return 404 and not 500.
in addition, to this we also make sure that CLI outputs are clean outputs and not involving the HTTP Response codes. There is not value in exposing HTTP Status Codes on CLI as it adds no values.
CLI Outputs Before this PR Change
CLI Outputs After this PR Change
closes #4234